search data
Compressing Search with Language Models
Mulc, Thomas, Steele, Jennifer L.
Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained language models and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.
Detecting Elevated Air Pollution Levels by Monitoring Web Search Queries: Deep Learning-Based Time Series Forecasting
Lin, Chen, Yousefi, Safoora, Kahoro, Elvis, Karisani, Payam, Liang, Donghai, Sarnat, Jeremy, Agichtein, Eugene
Real-time air pollution monitoring is a valuable tool for public health and environmental surveillance. In recent years, there has been a dramatic increase in air pollution forecasting and monitoring research using artificial neural networks (ANNs). Most of the prior work relied on modeling pollutant concentrations collected from ground-based monitors and meteorological data for long-term forecasting of outdoor ozone, oxides of nitrogen, and PM2.5. Given that traditional, highly sophisticated air quality monitors are expensive and are not universally available, these models cannot adequately serve those not living near pollutant monitoring sites. Furthermore, because prior models were built on physical measurement data collected from sensors, they may not be suitable for predicting public health effects experienced from pollution exposure. This study aims to develop and validate models to nowcast the observed pollution levels using Web search data, which is publicly available in near real-time from major search engines. We developed novel machine learning-based models using both traditional supervised classification methods and state-of-the-art deep learning methods to detect elevated air pollution levels at the US city level, by using generally available meteorological data and aggregate Web-based search volume data derived from Google Trends. We validated the performance of these methods by predicting three critical air pollutants (ozone (O3), nitrogen dioxide (NO2), and fine particulate matter (PM2.5)), across ten major U.S. metropolitan statistical areas (MSAs) in 2017 and 2018.
Multiwave COVID-19 Prediction via Social Awareness-Based Graph Neural Networks using Mobility and Web Search Data
Xue, J., Yabe, T., Tsubouchi, K., Ma, J., Ukkusuri, S. V.
Recurring outbreaks of COVID-19 have posed enduring effects on global society, which calls for a predictor of pandemic waves using various data with early availability. Existing prediction models that forecast the first outbreak wave using mobility data may not be applicable to the multiwave prediction, because the evidence in the USA and Japan has shown that mobility patterns across different waves exhibit varying relationships with fluctuations in infection cases. Therefore, to predict the multiwave pandemic, we propose a Social Awareness-Based Graph Neural Network (SAB-GNN) that considers the decay of symptom-related web search frequency to capture the changes in public awareness across multiple waves. SAB-GNN combines GNN and LSTM to model the complex relationships among urban districts, inter-district mobility patterns, web search history, and future COVID-19 infections. We train our model to predict future pandemic outbreaks in the Tokyo area using its mobility and web search data from April 2020 to May 2021 across four pandemic waves collected by _ANONYMOUS_COMPANY_ under strict privacy protection rules. Results show our model outperforms other baselines including ST-GNN and MPNN+LSTM. Though our model is not computationally expensive (only 3 layers and 10 hidden neurons), the proposed model enables public agencies to anticipate and prepare for future pandemic outbreaks.
The Proper Use of Google Trends in Forecasting Models
Medeiros, Marcelo C., Pires, Henrique F.
It is widely known that Google Trends have become one of the most popular free tools used by forecasters both in academics and in the private and public sectors. There are many papers, from several different fields, concluding that Google Trends improve forecasts' accuracy. However, what seems to be widely unknown, is that each sample of Google search data is different from the other, even if you set the same search term, data and location. This means that it is possible to find arbitrary conclusions merely by chance. This paper aims to show why and when it can become a problem and how to overcome this obstacle.
Zero-Shot Heterogeneous Transfer Learning from Recommender Systems to Cold-Start Search Retrieval
Wu, Tao, Chio, Ellie Ka-In, Cheng, Heng-Tze, Du, Yu, Rendle, Steffen, Kuzmin, Dima, Agarwal, Ritesh, Zhang, Li, Anderson, John, Singh, Sarvjeet, Chandra, Tushar, Chi, Ed H., Li, Wen, Kumar, Ankit, Ma, Xiang, Soares, Alex, Jindal, Nitin, Cao, Pei
Many recent advances in neural information retrieval models, which predict top-K items given a query, learn directly from a large training set of (query, item) pairs. However, they are often insufficient when there are many previously unseen (query, item) combinations, often referred to as the cold start problem. Furthermore, the search system can be biased towards items that are frequently shown to a query previously, also known as the 'rich get richer' (a.k.a. feedback loop) problem. In light of these problems, we observed that most online content platforms have both a search and a recommender system that, while having heterogeneous input spaces, can be connected through their common output item space and a shared semantic representation. In this paper, we propose a new Zero-Shot Heterogeneous Transfer Learning framework that transfers learned knowledge from the recommender system component to improve the search component of a content platform. First, it learns representations of items and their natural-language features by predicting (item, item) correlation graphs derived from the recommender system as an auxiliary task. Then, the learned representations are transferred to solve the target search retrieval task, performing query-to-item prediction without having seen any (query, item) pairs in training. We conduct online and offline experiments on one of the world's largest search and recommender systems from Google, and present the results and lessons learned. We demonstrate that the proposed approach can achieve high performance on offline search retrieval tasks, and more importantly, achieved significant improvements on relevance and user interactions over the highly-optimized production system in online experiments.
China and scientists dismiss study suggesting coronavirus spread in August 2019
LONDON โ Beijing dismissed as "ridiculous" a Harvard Medical School study of hospital traffic and search engine data that suggested the novel coronavirus may already have been spreading in China last August, and scientists said it offered no convincing evidence of when the outbreak began. The research, which has not been peer-reviewed by other scientists, used satellite imagery of hospital parking lots in Wuhan -- where the disease was first identified in late 2019 -- and data for symptom-related queries on search engines for terms such as "cough" and "diarrhea." The study's authors said increased hospital traffic and symptom search data in Wuhan preceded the documented start of the coronavirus pandemic, in December 2019. "While we cannot confirm if the increased volume was directly related to the new virus, our evidence supports other recent work showing that emergence happened before identification at the Huanan Seafood market (in Wuhan)," they said. Paul Digard, an expert in virology at the University of Edinburgh, said that using search engine data and satellite imagery of hospital traffic to detect disease outbreaks "is an interesting idea with some validity."
China pushes back against Harvard coronavirus study
Beijing has dismissed as "ridiculous" a Harvard Medical School study of hospital traffic and search engine data that suggested the new coronavirus may already have been spreading in China last August, and scientists said it offered no convincing evidence of when the outbreak began. Chinese Foreign Ministry spokeswoman Hua Chunying, asked about the research at a news briefing on Tuesday, said: "I think it is ridiculous, incredibly ridiculous, to come up with this conclusion based on superficial observations such as traffic volume." The research, which has not been peer-reviewed by other scientists, used satellite imagery of hospital parking lots in Wuhan - where the disease was first identified in late 2019 - and data for symptom-related queries on search engines for things such as "cough" and "diarrhoea". The study's authors said increased hospital traffic and symptom search data in Wuhan preceded the documented start of the coronavirus pandemic in December 2019. "While we cannot confirm if the increased volume was directly related to the new virus, our evidence supports other recent work showing that emergence happened before identification at the Huanan Seafood market (in Wuhan)," they said.
Google tells 1.1million children that Santa doesn't exist
Do you remember the moment you found out the truth about Santa? Analysis of Google search data surrounding Santa found that on average, 1,116,500 children ask Google "Is Santa Real" each year. And when exploring the answer provided by the world's leading search engine, Google displays an article with an opening sentence saying "as adults we know Santa Claus isn't real". The article written by online publisher Quartz, aims to give advice to parents regarding what to say when your child asks "Is Santa Real?" but doesn't realise that the opening sentence of their article is the first to be seen by over a million children worldwide, shattering their beliefs instantly. Speaking to experts in Google search results, Stephen Kenwright, Technical Search Engine Optimisation director at Rise at Seven, said that "Google is ranking this article on Quartz as the no.1 result based on the authority of the domain and reliability of the content. "Google's algorithms choose the answer which bests answers the question searched, taking safety into consideration all whilst being factually accurate.
Using Search Queries to Understand Health Information Needs in Africa
Abebe, Rediet, Hill, Shawndra, Vaughan, Jennifer Wortman, Small, Peter M., Schwartz, H. Andrew
The lack of comprehensive, high-quality health data in developing nations creates a roadblock for combating the impacts of disease. One key challenge is understanding the health information needs of people in these nations. Without understanding people's everyday needs, concerns, and misconceptions, health organizations and policymakers lack the ability to effectively target education and programming efforts. In this paper, we propose a bottom-up approach that uses search data from individuals to uncover and gain insight into health information needs in Africa. We analyze Bing searches related to HIV/AIDS, malaria, and tuberculosis from all 54 African nations. For each disease, we automatically derive a set of common search themes or topics, revealing a wide-spread interest in various types of information, including disease symptoms, drugs, concerns about breastfeeding, as well as stigma, beliefs in natural cures, and other topics that may be hard to uncover through traditional surveys. We expose the different patterns that emerge in health information needs by demographic groups (age and sex) and country. We also uncover discrepancies in the quality of content returned by search engines to users by topic. Combined, our results suggest that search data can help illuminate health information needs in Africa and inform discussions on health policy and targeted education efforts both on- and offline.
Views of AI, robots, and automation based on internet search data
Artificial intelligence, robots, and automation are rising in importance in many areas. As noted in the recent book, "The Future of Work: Robots, AI, and Automation," there are exciting advances in finance, transportation, national defense, smart cities, and health care, among other areas. Businesses are developing solutions that improve the efficiency and effectiveness of their operations and using these tools to improve the way their firms function. Yet there also are concerns about the impact of these developments on jobs and personal privacy. A Pew Research Center national survey revealed considerable unease about emerging trends.